The object of this project is to perform exploratory data analysis on NHL player statistics, salary, draft year, and demographic information. Later, will predict player salaries from on-ice statistics, and evaluate prediction models for best fit.
library(tidyverse)
library(psych)
library(Hmisc)
library(dplyr)
library(plotly)
library(car)
Loading required package: carData
Attaching package: ‘car’
The following object is masked from ‘package:psych’:
logit
The following object is masked from ‘package:dplyr’:
recode
The following object is masked from ‘package:purrr’:
some
getwd()
[1] "/Users/sarahmcdonald/Desktop"
#setwd('/Users/sarahmcdonald/Downloads')
nhl<- read.csv('train.csv')
head(nhl,3)
The following is a column legend from https://www.kaggle.com/datasets/camnugent/predict-nhl-player-salaries?resource=download and hockeyabstract.com
colnames(nhl)
[1] "Salary" "Born" "City" "Pr.St" "Cntry" "Nat" "Ht" "Wt"
[9] "DftYr" "DftRd" "Ovrl" "Hand" "Last.Name" "First.Name" "Position" "Team"
[17] "GP" "G" "A" "A1" "A2" "PTS" "X..." "E..."
[25] "PIM" "Shifts" "TOI" "TOIX" "TOI.GP" "TOI.GP.1" "TOI." "IPP."
[33] "SH." "SV." "PDO" "F.60" "A.60" "Pct." "Diff" "Diff.60"
[41] "iCF" "iCF.1" "iFF" "iSF" "iSF.1" "iSF.2" "ixG" "iSCF"
[49] "iRB" "iRS" "iDS" "sDist" "sDist.1" "Pass" "iHF" "iHF.1"
[57] "iHA" "iHDf" "iMiss" "iGVA" "iTKA" "iBLK" "iGVA.1" "iTKA.1"
[65] "iBLK.1" "BLK." "iFOW" "iFOL" "iFOW.1" "iFOL.1" "FO." "X.FOT"
[73] "dzFOW" "dzFOL" "nzFOW" "nzFOL" "ozFOW" "ozFOL" "FOW.Up" "FOL.Up"
[81] "FOW.Down" "FOL.Down" "FOW.Close" "FOL.Close" "OTG" "X1G" "GWG" "ENG"
[89] "PSG" "PSA" "G.Bkhd" "G.Dflct" "G.Slap" "G.Snap" "G.Tip" "G.Wrap"
[97] "G.Wrst" "CBar" "Post" "Over" "Wide" "S.Bkhd" "S.Dflct" "S.Slap"
[105] "S.Snap" "S.Tip" "S.Wrap" "S.Wrst" "iPenT" "iPenD" "iPENT" "iPEND"
[113] "iPenDf" "NPD" "Min" "Maj" "Match" "Misc" "Game" "CF"
[121] "CA" "FF" "FA" "SF" "SA" "xGF" "xGA" "SCF"
[129] "SCA" "GF" "GA" "RBF" "RBA" "RSF" "RSA" "DSF"
[137] "DSA" "FOW" "FOL" "HF" "HA" "GVA" "TKA" "PENT"
[145] "PEND" "OPS" "DPS" "PS" "OTOI" "Grit" "DAP" "Pace"
[153] "GS" "GS.G" "Salary_scale"
Acronym - Meaning
%FOT - Percentage of all on-ice faceoffs taken by this player.
+/- - Plus/minus
1G - First goals of a game
A/60 - Events Against per 60 minutes, defaults to Corsi, but can be set to another stat
A1 - First assists, primary assists
A2 - Second assists, secondary assists
BLK% - Percentage of all opposing shot attempts blocked by this player
Born - Birth date
C.Close - A player shot attempt (Corsi) differential when the game was close
C.Down - A player shot attempt (Corsi) differential when the team was trailing
C.Tied - A player shot attempt (Corsi) differential when the team was tied
C.Up - A player shot attempt (Corsi) differential when the team was in the lead
CA - Shot attempts allowed (Corsi, SAT) while this player was on the ice
Cap Hit - The player’s cap hit
CBar - Crossbars hit
CF - The team’s shot attempts (Corsi, SAT) while this player was on the ice
CF.QoC - A weighted average of the Corsi percentage of a player’s opponents
CF.QoT - A weighted average of the Corsi percentage of a player’s linemates
CHIP - Cap Hit of Injured Player is games lost to injury multiplied by cap hit per game
City - City of birth
Cntry - Country of birth
DAP - Disciplined aggression proxy, which is hits and takeaways divided by minor penalties
DFA - Dangerous Fenwick against, which is on-ice unblocked shot attempts weighted by shot quality
DFF - Dangerous Fenwick for, which is on-ice unblocked shot attempts weighted by shot quality
DFF.QoC - Quality of Competition metric based on Dangerous Fenwick, which is unblocked shot attempts weighted for shot quality
DftRd - Round in which the player was drafted
DftYr - Year drafted
Diff - Events for minus event against, defaults to Corsi, but can be set to another stat
Diff/60 - Events for minus event against, per 60 minutes, defaults to Corsi, but can be set to another stat
DPS - Defensive point shares, a catch-all stats that measures a player’s defensive contributions in points in the standings
DSA - Dangerous shots allowed while this player was on the ice, which is rebounds plus rush shots
DSF - The team’s dangerous shots while this player was on the ice, which is rebounds plus rush shots
DZF - Shifts this player has ended with an defensive zone faceoff
dzFOL - Faceoffs lost in the defensive zone
dzFOW - Faceoffs win in the defensive zone
dzGAPF - Team goals allowed after faceoffs taken in the defensive zone
dzGFPF - Team goals scored after faceoffs taken in the defensive zone
DZS - Shifts this player has started with an defensive zone faceoff
dzSAPF - Team shot attempts allowed after faceoffs taken in the defensive zone
dzSFPF - Team shot attempts taken after faceoffs taken in the defensive zone
E+/- - A player’s expected +/-, based on his team and minutes played
ENG - Empty-net goals
Exp dzNGPF - Expected goal differential after faceoffs taken in the defensive zone, based on the number of them
Exp dzNSPF - Expected shot differential after faceoffs taken in the defensive zone, based on the number of them
Exp ozNGPF - Expected goal differential after faceoffs taken in the offensive zone, based on the number of them
Exp ozNSPF - Expected shot differential after faceoffs taken in the offensive zone, based on the number of them
F.Close - A player unblocked shot attempt (Fenwick) differential when the game was close
F.Down - A player unblocked shot attempt (Fenwick) differential when the team was trailing
F.Tied - A player unblocked shot attempt (Fenwick) differential when the team was tied
F.Up - A player unblocked shot attempt (Fenwick) differential when the team was in the lead. Not the best acronym.
F/60 - Events For per 60 minutes, defaults to Corsi, but can be set to another stat
FA - Unblocked shot attempts allowed (Fenwick, USAT) while this player was on the ice
FF - The team’s unblocked shot attempts (Fenwick, USAT) while this player was on the ice
First Name -
FO% - Faceoff winning percentage
FO%vsL - Faceoff winning percentage against lefthanded opponents
FO%vsR - Faceoff winning percentage against righthanded opponents
FOL - The team’s faceoff losses while this player was on the ice
FOL.Close - Faceoffs lost when the score was close
FOL.Down - Faceoffs lost when the team was trailing
FOL.Up - Faceoffs lost when the team was in the lead
FovsL - Faceoffs taken against lefthanded opponents
FovsR - Faceoffs taken against righthanded opponents
FOW - The team’s faceoff wins while this player was on the ice
FOW.Close - Faceoffs won when the score was close
FOW.Down - Faceoffs won when the team was trailing
FOW.Up - Faceoffs won when the team was in the lead
G - Goals
G.Bkhd - Goals scored on the backhand
G.Dflct - Goals scored with deflections
G.Slap - Goals scored with slap shots
G.Snap - Goals scored with snap shots
G.Tip - Goals scored with tip shots
G.Wrap - Goals scored with a wraparound
G.Wrst - Goals scored with a wrist shot
GA - Goals allowed while this player was on the ice
Game - Game Misconduct penalties
GF - The team’s goals while this player was on the ice
GP - Games Played
Grit - Defined as hits, blocked shots, penalty minutes, and majors
GS - The player’s combined game score
GS/G - The player’s average game score
GVA - The team’s giveaways while this player was on the ice
GWG - Game-winning goals
GWG - Game-winning goals
HA - The team’s hits taken while this player was on the ice
Hand - Handedness
HF - The team’s hits thrown while this player was on the ice
HopFO - Opening faceoffs taken at home
HopFOW - Opening faceoffs won at home
Ht - Height
iBLK - Shots blocked by this individual
iCF - Shot attempts (Corsi, SAT) taken by this individual
iDS - Dangerous shots taken by this player, the sum of rebounds and shots off the rush
iFF - Unblocked shot attempts (Fenwick, USAT) taken by this individual
iFOL - Faceoff losses by this individual
iFOW - Faceoff wins by this individual
iGVA - Giveaways by this individual
iHA - Hits taken by this individual
iHDf - The difference in hits thrown by this individual minus those taken
iHF - Hits thrown by this individual
iMiss - Individual shots taken that missed the net.
Injuries - List of types of injuries incurred, if any
iPEND - Penalties drawn by this individual
iPenDf - The difference in penalties drawn minus those taken
iPENT - Penalties taken by this individual
IPP% - Individual points percentage, which is on-ice goals for which this player had the goal or an assist
iRB - Rebound shots taken by this individual
iRS - Shots off the rush taken by this individual
iSCF - All scoring chances taken by this individual
iSF - Shots on goal taken by this individual
iTKA - Takeaways by this individual
ixG - Expected goals (weighted shots) for this individual, which is shot attempts weighted by shot location
Last Name -
Maj - Major penalties taken
Match - Match penalties
MGL - Games lost due to injury
Min - Minor penalties taken
Misc - Misconduct penalties
Nat - Nationality
NGPF - Net Goals Post Faceoff. A differential of all goals within 10 seconds of a faceoff, relative to expectations set by the zone in which they took place
NHLid - NHL player id useful when looking at the raw data in game files
NMC - What kind of no-movement clause this player’s contract has, if any
NPD - Net Penalty Differential is the player’s penalty differential relative to a player of the same position with the same ice time per manpower situation
NSPF - Net Shots Post Faceoff. A differential of all shot attempts within 10 seconds of a faceoff, relative to expectations set by the zone in which they took place
NZF - Shifts this player has ended with a neutral zone faceoff
nzFOL - Faceoffs lost in the neutral zone
nzFOW - Faceoffs won in the neutral zone
nzGAPF - Team goals allowed after faceoffs taken in the neutral zone
nzGFPF - Team goals scored after faceoffs taken in the neutral zone
NZS - Shifts this player has started with a neutral zone faceoff
nzSAPF - Team shot attempts allowed after faceoffs taken in the neutral zone
nzSFPF - Team shot attempts taken after faceoffs taken in the neutral zone
OCA - Shot attempts allowed (Corsi, SAT) while this player was not on the ice
OCF - The team’s shot attempts (Corsi, SAT) while this player was not on the ice
ODZS - Defensive zone faceoffs that occurred without this player on the ice
OFA - Unblocked shot attempts allowed (Fenwick, USAT) while this player was not on the ice
OFF - The team’s unblocked shot attempts (Fenwick, USAT) while this player was not on the ice
OGA - Goals allowed while this player was not on the ice
OGF - The team’s goals while this player was not on the ice
ONZS - Neutral zone faceoffs that occurred without this player on the ice
OOZS - Offensive zone faceoffs that occurred without this player on the ice
OpFO - Opening faceoffs taken
OpFOW - Opening faceoffs won
OppCA60 - A weighted average of the shot attempts (Corsi, SAT) the team allowed per 60 minutes of a player’s opponents
OppCF60 - A weighted average of the shot attempts (Corsi, SAT) the team generated per 60 minutes of a player’s opponents
OppFA60 - A weighted average of the unblocked shot attempts (Fenwick, USAT) the team allowed per 60 minutes of a player’s opponents
OppFF60 - A weighted average of the unblocked shot attempts (Fenwick, USAT) the team generated per 60 minutes of a player’s opponents
OppGA60 - A weighted average of the goals the team allowed per 60 minutes of a player’s opponents
OppGF60 - A weighted average of the goals the team scored per 60 minutes of a player’s opponents
OppSA60 - A weighted average of the shots on goal the team allowed per 60 minutes of a player’s opponents
OppSF60 - A weighted average of the shots on goal the team generated per 60 minutes of a player’s opponents
OPS - Offensive point shares, a catch-all stats that measures a player’s offensive contributions in points in the standings
OSA - Shots on goal allowed while this player was not on the ice
OSCA - Scoring chances allowed while this player was not on the ice
OSCF - The team’s scoring chances while this player was not on the ice
OSF - The team’s shots on goal while this player was not on the ice
OTF - Shifts this player started with an on-the-fly change
OTG - Overtime goals
OTOI - The amount of time this player was not on the ice.
Over - Shots that went over the net
Ovrl - Where the player was drafted overall
OxGA - Expected goals allowed (weighted shots) while this player was not on the ice, which is shot attempts weighted by location
OxGF - The team’s expected goals (weighted shots) while this player was not on the ice, which is shot attempts weighted by location
OZF - Shifts this player has ended with an offensive zone faceoff
ozFO - Faceoffs taken in the offensive zone
ozFOL - Faceoffs lost in the offensive zone
ozFOW - Faceoffs won in the offensive zone
ozGAPF - Team goals allowed after faceoffs taken in the offensive zone
ozGFPF - Team goals scored after faceoffs taken in the offensive zone
OZS - Shifts this player has started with an offensive zone faceoff
ozSAPF - Team shot attempts allowed after faceoffs taken in the offensive zone
ozSFPF - Team shot attempts taken after faceoffs taken in the offensive zone
Pace - The average game pace, as estimated by all shot attempts per 60 minutes
Pass - An estimate of the player’s setup passes (passes that result in a shot attempt)
Pct% - Percentage of all events produced by this team, defaults to Corsi, but can be set to another stat
PDO - The team’s shooting and save percentages added together, times a thousand
PEND - The team’s penalties drawn while this player was on the ice
PENT - The team’s penalties taken while this player was on the ice
PIM - Penalties in minutes
Position - Positions played. NHL source listed first, followed by those listed by any other source.
Post - Times hit the post
Pr/St - Province or state of birth
PS - Point shares, a catch-all stats that measures a player’s contributions in points in the standings
PSA - Penalty shot attempts
PSG - Penalty shot goals
PTS - Points. Goals plus all assists
PTS/60 - Points per 60 minutes
QRelCA60 - Shot attempts allowed per 60 minutes relative to how others did against the same competition
QRelCF60 - Shot attempts per 60 minutes relative to how others did against the same competition
QRelDFA60 - Weighted unblocked shot attempts (Dangeorus Fenwick) allowed per 60 minutes relative to how others did against the same competition
QRelDFF60 - Weighted unblocked shot attempts (Dangeorus Fenwick) per 60 minutes relative to how others did against the same competition
RBA - Rebounds allowed while this player was on the ice. Two very different sources.
RBF - The team’s rebounds while this player was on the ice. Two very different sources.
RelA/60 - The player’s A/60 relative to the team when he’s not on the ice
RelC/60 - Corsi differential per 60 minutes relative to his team
RelC% - Corsi percentage relative to his team
RelDf/60 - The player’s Diff/60 relative to the team when he’s not on the ice
RelF/60 - The player’s F/60 relative to the team when he’s not on the ice
RelF/60 - Fenwick differential per 60 minutes relative to his team
RelF% - Fenwick percentage relative to his team
RelPct% - The players Pct% relative to the team when he’s not on the ice
RelZS% - The player’s zone start percentage when he’s on the ice relative to when he’s not.
RopFO - Opening faceoffs taken at home
RopFOW - Opening faceoffs won at home
RSA - Shots off the rush allowed while this player was on the ice
RSF - The team’s shots off the rush while this player was on the ice
S.Bkhd - Backhand shots
S.Dflct - Deflections
S.Slap - Slap shots
S.Snap - Snap shots
S.Tip - Tipped shots
S.Wrap - Wraparound shots
S.Wrst - Wrist shots
SA - Shots on goal allowed while this player was on the ice
Salary - The player’s salary
SCA - Scoring chances allowed while this player was on the ice
SCF - The team’s scoring chances while this player was on the ice
sDist - The average shot distance of shots taken by this player
SF - The team’s shots on goal while this player was on the ice
SH% - The team’s (not individual’s) shooting percentage when the player was on the ice
SOG - Shootout Goals
SOGDG - Game-deciding shootout goals
SOS - Shootout Shots
Status - This player’s free agency status
SV% - The team’s save percentage when the player was on the ice
Team -
TKA - The team’s takeaways while this player was on the ice
TMCA60 - A weighted average of the shot attempts (Corsi, SAT) the team allowed per 60 minutes of a player’s linemates
TMCF60 - A weighted average of the shot attempts (Corsi, SAT) the team generated per 60 minutes of a player’s linemates
TMFA60 - A weighted average of the unblocked shot attempts (Fenwick, USAT) the team allowed per 60 minutes of a player’s linemates
TMFF60 - A weighted average of the unblocked shot attempts (Fenwick, USAT) the team generated per 60 minutes of a player’s linemates
TMGA60 - A weighted average of the goals the team allowed per 60 minutes of a player’s linemates
TMGF60 - A weighted average of the goals the team scored per 60 minutes of a player’s linemates
TMSA60 - A weighted average of the shots on goal the team allowed per 60 minutes of a player’s linemates
TMSF60 - A weighted average of the shots on goal the team generated per 60 minutes of a player’s linemates
TmxGF - A weighted average of a player’s linemates of the expected goals the team scored
TmxGA - A weighted average of a player’s linemates of the expected goals the team allowed
TMGA - A weighted average of a player’s linemates of the goals the team scored
TMGF - A weighted average of a player’s linemates of the goals the team allowed
TOI - Time on ice, in minutes, or in seconds (NHL)
TOI.QoC - A weighted average of the TOI% of a player’s opponents.
TOI.QoT - A weighted average of the TOI% of a player’s linemates.
TOI/GP - Time on ice divided by games played
TOI% - Percentage of all available ice time assigned to this player.
Wide - Shots that went wide of the net
Wt - Weight
xGA - Expected goals allowed (weighted shots) while this player was on the ice, which is shot attempts weighted by location
xGF - The team’s expected goals (weighted shots) while this player was on the ice, which is shot attempts weighted by location
xGF.QoC - A weighted average of the expected goal percentage of a player’s opponents
xGF.QoT - A weighted average of the expected goal percentage of a player’s linemates
ZS% - Zone start percentage, the percentage of shifts started in the offensive zone, not counting neutral zone or on-the-fly changes
Renaming columns we will be using
colnames(nhl)[colnames(nhl) == "X..."] ="plus_minus"
colnames(nhl)[colnames(nhl) == "E..."] ="E_plus_minus"
colnames(nhl)[colnames(nhl) == "TOI."] ="TOI_pct"
colnames(nhl)[colnames(nhl) == "FO."] ="FO_pct"
We want to predict salary, so let’s adjust the salary scale to be in millions, where $1,000,000 is represented instead as 1 to make visualization easier. Will preserve the original Salary column.
nhl$Salary_scale <- nhl$Salary/1000000
Will view both salary columns side by side to assess if this transformation accomplished what we wanted.
salary_filtered <- nhl %>%
select(Salary,Salary_scale)
salary_filtered
NA
team_count <- dplyr::count(nhl, Team, sort = TRUE)
team_count
We can see the dataset shows traded players using both teams listed together as player team, giving 68 distinct Team representations for the year 30 teams were in the league. Given the salary cap and CBA, a player’s salary won’t change when traded from team to team, so we won’t include this in our model.
Let’s visualize the range of salaries for the players included in this dataset since that is our target for modeling.
plot1<- ggplot(data = nhl, aes(Salary_scale))+
geom_histogram(fill = 'blue')+ scale_x_continuous(breaks = seq(1, 15, by = 1))
plot1
We can visualize the distribution of salaries vs the mean salary
plot1 + geom_vline(aes(xintercept = mean(Salary_scale)),
color = 'red', linetype = 'dashed')
mean_sal<- mean(nhl$Salary_scale)
round(mean_sal,3)
[1] 2.265
describe(nhl$Salary)
nhl$Salary
n missing distinct Info Mean Gmd .05 .10
612 0 138 0.997 2264509 2180437 575000 600000
.25 .50 .75 .90 .95
742500 925000 3500000 5490000 6862500
lowest : 575000 590000 595000 600000 615000
highest: 10000000 10900000 11000000 12000000 13800000
sd_sal <- sd(nhl$Salary_scale)
sd_sal
[1] 2.23634
The proportion of salaries below the mean might skew the data when attempting to model it.
ggplot(data = nhl) +
geom_point(aes(x = G, y = Salary_scale, color = G))+scale_color_gradient(low = 'purple', high = 'deeppink')
We can see a general positive relationship between goals scored and salary, but there may be more to the scoring prediction than just goals.
cor.test(nhl$G, nhl$Salary_scale)
Pearson's product-moment correlation
data: nhl$G and nhl$Salary_scale
t = 15.666, df = 610, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4766020 0.5898408
sample estimates:
cor
0.535625
ggplot (data = nhl) +
geom_point(aes(x = PTS, y = Salary_scale, color = PTS))
cor.test(nhl$PTS, nhl$Salary)
Pearson's product-moment correlation
data: nhl$PTS and nhl$Salary
t = 19.85, df = 610, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.5757775 0.6723210
sample estimates:
cor
0.6264459
Again there is a high frequency of data points on the low end, so with <25 points and low salary. What could be causing this? The league minimum salary in the season this dataset explores (2016-2017) was $575,000. So if a player played only one game in the NHL that season, their pay rate would be at the league minimum. This explains the high proportion of Salaries below $1 million when we visualized the distribution.
Lets create a new filtered datasets with players making above league minimum and above 1 million and visualize.
nhl_1 <- nhl %>% #players making above league minimum
filter(Salary_scale >= 0.575)
ggplot(nhl_1, aes(Salary_scale))+
geom_histogram(fill = 'red') + scale_x_continuous(breaks = seq(1, 15, by = 1))
#plot1<- ggplot(data = nhl, aes(Salary_scale))+
#$ geom_histogram(fill = 'blue')+ scale_x_continuous(breaks = seq(1, 15, by = 1))
There is a minor change, but still not significant.
nhl_2 <- nhl %>%
filter(Salary_scale > 1)
ggplot(nhl_2, aes(Salary_scale))+
geom_histogram(fill = 'green') + scale_x_continuous(breaks = seq(1, 15, by = 1))
This shows a significant change in distribution, a much more normal distribution with positive skew.
cor.test(nhl_2$PTS, nhl_2$Salary)
Pearson's product-moment correlation
data: nhl_2$PTS and nhl_2$Salary
t = 11.36, df = 277, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4779796 0.6389008
sample estimates:
cor
0.563767
We could also filter the dataset for players who played at least half of a season, or 41 games.
nhl_3<- nhl %>%
filter(GP >= 41)
ggplot(nhl_3, aes(Salary_scale))+
geom_histogram(fill = 1:30) + scale_x_continuous(breaks = seq(1, 15, by = 1))
Since we are going to predict salary, it’s best not to filter the data based on salary to avoid too much manipulation on the model. So we will use the 41 games played threshold for our dataset.
ggplot(nhl_3)+
geom_point(aes(x = PTS, y = Salary, color = PTS))
cor.test(nhl_3$PTS, nhl_3$Salary)
Pearson's product-moment correlation
data: nhl_3$PTS and nhl_3$Salary
t = 12.036, df = 402, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4391692 0.5829719
sample estimates:
cor
0.5146811
We can filter this dataset for some of the most commonly measured statistics in hockey: position, team, games played, goals, assists, points, time on ice per game, penalty minutes, expected goals for, and points share.
colnames(nhl_3)
[1] "Salary" "Born" "City" "Pr.St" "Cntry"
[6] "Nat" "Ht" "Wt" "DftYr" "DftRd"
[11] "Ovrl" "Hand" "Last.Name" "First.Name" "Position"
[16] "Team" "GP" "G" "A" "A1"
[21] "A2" "PTS" "X..." "E..." "PIM"
[26] "Shifts" "TOI" "TOIX" "TOI.GP" "TOI.GP.1"
[31] "TOI." "IPP." "SH." "SV." "PDO"
[36] "F.60" "A.60" "Pct." "Diff" "Diff.60"
[41] "iCF" "iCF.1" "iFF" "iSF" "iSF.1"
[46] "iSF.2" "ixG" "iSCF" "iRB" "iRS"
[51] "iDS" "sDist" "sDist.1" "Pass" "iHF"
[56] "iHF.1" "iHA" "iHDf" "iMiss" "iGVA"
[61] "iTKA" "iBLK" "iGVA.1" "iTKA.1" "iBLK.1"
[66] "BLK." "iFOW" "iFOL" "iFOW.1" "iFOL.1"
[71] "FO." "X.FOT" "dzFOW" "dzFOL" "nzFOW"
[76] "nzFOL" "ozFOW" "ozFOL" "FOW.Up" "FOL.Up"
[81] "FOW.Down" "FOL.Down" "FOW.Close" "FOL.Close" "OTG"
[86] "X1G" "GWG" "ENG" "PSG" "PSA"
[91] "G.Bkhd" "G.Dflct" "G.Slap" "G.Snap" "G.Tip"
[96] "G.Wrap" "G.Wrst" "CBar" "Post" "Over"
[101] "Wide" "S.Bkhd" "S.Dflct" "S.Slap" "S.Snap"
[106] "S.Tip" "S.Wrap" "S.Wrst" "iPenT" "iPenD"
[111] "iPENT" "iPEND" "iPenDf" "NPD" "Min"
[116] "Maj" "Match" "Misc" "Game" "CF"
[121] "CA" "FF" "FA" "SF" "SA"
[126] "xGF" "xGA" "SCF" "SCA" "GF"
[131] "GA" "RBF" "RBA" "RSF" "RSA"
[136] "DSF" "DSA" "FOW" "FOL" "HF"
[141] "HA" "GVA" "TKA" "PENT" "PEND"
[146] "OPS" "DPS" "PS" "OTOI" "Grit"
[151] "DAP" "Pace" "GS" "GS.G" "salary_scale"
[156] "Salary_scale"
nhl_4 <- nhl_3 %>%
arrange(desc(Salary)) %>%
select(Salary, Salary_scale, Last.Name, First.Name, Position,Team, GP, G, A, PTS, plus_minus, E_plus_minus, FO_pct, TOI.GP, TOI_pct, PIM, xGF, GF, xGA, GA, xGA, PS)
nhl_4
We can see that Jonathan Toews and Patrick Kane were the highest paid players that season with salaries of $13.8 million, scoring 58 and 89 points, respectively. Who was the leading points scorer that season?
So the highest paid player in Patrick Kane was tied with Sidney Crosby, making $10.9 million for the most points in the league with 89. Let’s see which players scored more points than Jonathan Toews with 58 points.
more_than_toews<- nhl_4 %>%
filter(nhl_4$PTS>nhl_4$PTS[2]) %>%
arrange(desc(PTS))
more_than_toews
We can see even though Toews was tied for the highest player that year, there were 35 players with more points than Jonathan Toews that season.
Toews is known for being a very good face-off player, so lets see how his face off percentage ranked that season.
faceoff<- nhl_4 %>%
filter(Position == 'C') %>%
arrange(desc(FO_pct)) %>%
select(Salary, Last.Name, First.Name, Team, FO_pct)
faceoff
NA
It’s important to note, the statistics selected describe player statistics as goalies are evaluated very differently, so we should filter any goaltenders out of this dataset before we build our model.
nhl_4<- nhl_4 %>%
mutate(fwd_or_d = ifelse(Position == 'D', 'D','Forward'))
nhl_4
We can group by position to find some salary information, but should separate by forward vs defense, as many forward positions are listed as “LW/RW” or “C/LW” and we don’t want those treated as separate groups.
nhl_4<- nhl_4 %>%
mutate(fwd_or_d = ifelse(Position == 'D', 'D','Forward'))
nhl_4
We can use this to group by forwards and defense to find mean, max, and minimum salaries amongst the 404 players with 41 or more games played.
max_sal<- nhl_4 %>%
group_by(fwd_or_d) %>%
summarise(max_salary = max(Salary), mean_salary = mean(Salary), min_salary = min(Salary))
max_sal
NA
It makes sense for the minimum salaries to be equal for both groups given the league minimum salary. Interestingly, forwards have a higher maximum salary but lower mean salary.
We can find the maximum, mean, and minimum salaries by team:
max_sal_team<- nhl_4 %>%
group_by(Team) %>%
summarise(max_salary = max(Salary), mean_salary = mean(Salary), min_salary = min(Salary))
max_sal_team
Let’s examine the variance and standard deviation in salary within position groups:
var_salary<- nhl_4 %>%
group_by(fwd_or_d) %>%
summarise(salary_variance = var(Salary), salary_sd = sd(Salary))
var_salary
Before building our model, lets assess values that may be highly correlated with one another, so we can avoid using both factors and artificially increasing our adjusted R - squared by adding another predictor if it is already highly correlated with another.
We know that points includes assists and goals, so those values should be highly correlated.
cor.test(nhl_4$PTS, nhl_4$G)
Pearson's product-moment correlation
data: nhl_4$PTS and nhl_4$G
t = 36.094, df = 402, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8490321 0.8953814
sample estimates:
cor
0.8741833
cor.test(nhl_4$PTS, nhl_4$A)
Pearson's product-moment correlation
data: nhl_4$PTS and nhl_4$A
t = 53.001, df = 402, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9218715 0.9465050
sample estimates:
cor
0.9353122
So using just points in our model should be sufficient.
biserial(nhl_4$PTS, nhl_4$fwd_or_d)
[,1]
[1,] 0.3367936
Points has a much lower correlation with position, so it may be acceptable to include forward vs defense in our model.
cor.test(nhl_4$PTS, nhl_4$xGF)
Pearson's product-moment correlation
data: nhl_4$PTS and nhl_4$xGF
t = 26.539, df = 402, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7594520 0.8307868
sample estimates:
cor
0.797896
There is also a high correaltion between points and expected goals for, which represents the expected amount of goals scored by a player’s team while that player is on the ice. I expect this will be highly correlated with points share:
cor.test(nhl_4$PS, nhl_4$xGF)
Pearson's product-moment correlation
data: nhl_4$PS and nhl_4$xGF
t = 34.599, df = 402, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8384312 0.8878377
sample estimates:
cor
0.8652198
mod_1<- lm(Salary~ PTS, data = nhl_4)
summary(mod_1)
Call:
lm(formula = Salary ~ PTS, data = nhl_4)
Residuals:
Min 1Q Median 3Q Max
-5178968 -1303655 -479053 1234023 8962796
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 970242 195590 4.961 1.04e-06 ***
PTS 66672 5539 12.036 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2089000 on 402 degrees of freedom
Multiple R-squared: 0.2649, Adjusted R-squared: 0.2631
F-statistic: 144.9 on 1 and 402 DF, p-value: < 2.2e-16
Here the model has a significant p-alue, adjusted R squared of 0.2631
mod_2 <- lm(Salary ~ PTS + plus_minus, data = nhl_4)
summary(mod_2)
Call:
lm(formula = Salary ~ PTS + plus_minus, data = nhl_4)
Residuals:
Min 1Q Median 3Q Max
-5171634 -1325591 -480902 1247097 8955043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 993119 201229 4.935 1.18e-06 ***
PTS 65878 5775 11.407 < 2e-16 ***
plus_minus 4418 8987 0.492 0.623
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2091000 on 401 degrees of freedom
Multiple R-squared: 0.2653, Adjusted R-squared: 0.2617
F-statistic: 72.42 on 2 and 401 DF, p-value: < 2.2e-16
The adjusted R-squared is slightly lower, with PTS remaining significant but plus-minus insignificant.
print(AIC(mod_1, k = 1))
[1] 12905.54
print(AIC(mod_2, k=2))
[1] 12910.29
As expected, the lower AIC corresponds to the first model as a better fitting model.
Expected GF and expected GA can describe how a player’s team is performing with that player on the ice, even if the player doesn’t directly influence the goals occurring. We can model Salary on xGF and GF, and xGA and GA to see if expected versus actual stats prove to be more significant predictors.
model_GF<- lm(Salary ~xGF+ GF, data = nhl_4)
summary(model_GF)
Call:
lm(formula = Salary ~ xGF + GF, data = nhl_4)
Residuals:
Min 1Q Median 3Q Max
-5045470 -1158864 -264927 1015947 9486805
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -207141 262087 -0.790 0.42979
xGF 39222 13842 2.834 0.00483 **
GF 19826 11963 1.657 0.09825 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1986000 on 401 degrees of freedom
Multiple R-squared: 0.337, Adjusted R-squared: 0.3337
F-statistic: 101.9 on 2 and 401 DF, p-value: < 2.2e-16
In this case, expected goals for is a more significant predictor than goals for. We can evaluate the same thing with goals against:
model_GA <- lm(Salary ~ xGA + GA, data = nhl_4)
summary(model_GA)
Call:
lm(formula = Salary ~ xGA + GA, data = nhl_4)
Residuals:
Min 1Q Median 3Q Max
-4410390 -1479158 -485561 1133690 9898661
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 131961 334249 0.395 0.6932
xGA 23035 14333 1.607 0.1088
GA 33007 13954 2.365 0.0185 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2219000 on 401 degrees of freedom
Multiple R-squared: 0.1724, Adjusted R-squared: 0.1683
F-statistic: 41.77 on 2 and 401 DF, p-value: < 2.2e-16
In this case, expected goals against is not significant, the intercept is not significant, and goals against only meets a 0.05 significance level.
mod_3 <- lm(Salary ~ PTS + xGF, data = nhl_4)
summary(mod_3)
Call:
lm(formula = Salary ~ PTS + xGF, data = nhl_4)
Residuals:
Min 1Q Median 3Q Max
-5275735 -1191225 -270731 1069698 9327346
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -223570 255579 -0.875 0.382
PTS 19470 8714 2.234 0.026 *
xGF 48332 7119 6.789 4.08e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1981000 on 401 degrees of freedom
Multiple R-squared: 0.3407, Adjusted R-squared: 0.3374
F-statistic: 103.6 on 2 and 401 DF, p-value: < 2.2e-16
In these models, xGF, PTS, and GA all show significant affects on the dependent variable, Salary. We can create a model with all three:
mod_4 <- lm(Salary~ PTS + GA + xGF, data = nhl_4)
summary(mod_4)
Call:
lm(formula = Salary ~ PTS + GA + xGF, data = nhl_4)
Residuals:
Min 1Q Median 3Q Max
-5261703 -1169247 -263778 1043683 9253499
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -321384 288049 -1.116 0.2652
PTS 21714 9234 2.352 0.0192 *
GA 5878 7968 0.738 0.4611
xGF 43479 9697 4.484 9.59e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1982000 on 400 degrees of freedom
Multiple R-squared: 0.3416, Adjusted R-squared: 0.3366
F-statistic: 69.17 on 3 and 400 DF, p-value: < 2.2e-16
When including all three variables, GA loses significance. This could be explained by a relatively strong correlation between xGF and GA, so including both will make one variable drop out of the range of significance.
cor.test(nhl_4$xGF, nhl_4$GA)
Pearson's product-moment correlation
data: nhl_4$xGF and nhl_4$GA
t = 20.264, df = 402, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6589918 0.7559865
sample estimates:
cor
0.7108527
We can confirm that model 3 with fewer predictors is the better model by examining AIC. Notavbly, each of these models have a lower AIC, and therefore better fit, than regressing Salary on points alone.
AIC(mod_4, k = 3)
[1] 12873.03
AIC(mod_3, k=2)
[1] 12866.58
Lastly, we can examine the relationship between Salary and just xGF:
mod_xGF<- lm(Salary~ xGF, data = nhl_4)
summary(mod_xGF)
Call:
lm(formula = Salary ~ xGF, data = nhl_4)
Residuals:
Min 1Q Median 3Q Max
-4968018 -1205154 -240009 1072044 9621790
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -325448 252724 -1.288 0.199
xGF 61025 4313 14.150 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1990000 on 402 degrees of freedom
Multiple R-squared: 0.3325, Adjusted R-squared: 0.3308
F-statistic: 200.2 on 1 and 402 DF, p-value: < 2.2e-16
AIC(mod_xGF)
[1] 12869.58
In summary, of the selected predictors and combinations, xGF and PTS are the best predictors of Salary when used as a multivariate regression, rather than either predictor alone.
We can view the relationship between PTS, xGF, and Salary(scaled) in a 3D plot:
plot_ly(x = nhl_4$PTS, y = nhl_4$Salary_scale, z = nhl_4$xGF, color = nhl_4$PTS) %>%
layout(scene=list(xaxis = list(title = 'Points'), yaxis = list(title='Salary in Millions'),zaxis = list(title = 'Expected Goals For')))
No trace type specified:
Based on info supplied, a 'scatter3d' trace seems appropriate.
Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
No scatter3d mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
No trace type specified:
Based on info supplied, a 'scatter3d' trace seems appropriate.
Read more about this trace type -> https://plotly.com/r/reference/#scatter3d
No scatter3d mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
The relationships can also be viewed in two dimensions using the ‘car’ library. In this visualization, the plotted line shows the relationship between the given predictor and the response variable with the other predictor held constant.
avPlots(mod_3)